library(ggplot2)
library(dtplyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
houseprice_dataset <- read.csv("~/Desktop/Applied Statistics/houseprice_dataset.txt", sep=";")
d <- read.csv("~/Desktop/Applied Statistics/houseprice_dataset.txt", sep=";")

Part 1

Numeriacal summary

summary(d)
##     n.rooms          build.year    postalcode        square_meters  
##  Min.   :  1.000   Min.   :1900   Length:1011        Min.   : 56.0  
##  1st Qu.:  2.000   1st Qu.:1960   Class :character   1st Qu.: 98.3  
##  Median :  3.000   Median :1991   Mode  :character   Median :110.3  
##  Mean   :  3.247   Mean   :1982                      Mean   :112.5  
##  3rd Qu.:  4.000   3rd Qu.:2007                      3rd Qu.:124.7  
##  Max.   :222.000   Max.   :2023                      Max.   :213.7  
##  swimpool_w1km     school_distance      type           sold_within1week
##  Min.   :0.00000   Min.   :0.0010   Length:1011        Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.2035   Class :character   1st Qu.:0.0000  
##  Median :0.00000   Median :0.4480   Mode  :character   Median :0.0000  
##  Mean   :0.04649   Mean   :0.6094                      Mean   :0.3858  
##  3rd Qu.:0.00000   3rd Qu.:0.7670                      3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :5.3490                      Max.   :1.0000  
##  selling_price   
##  Min.   : 32.71  
##  1st Qu.: 55.29  
##  Median : 64.97  
##  Mean   : 67.77  
##  3rd Qu.: 76.34  
##  Max.   :234.36
head(d)
##   n.rooms build.year postalcode square_meters swimpool_w1km school_distance
## 1       3       1938         A2          96.4             0           0.101
## 2       5       2018         A7         132.6             0           2.170
## 3       4       1995         A4         119.7             0           0.372
## 4       1       1900         A1          72.2             0           0.855
## 5       1       1984         A4          56.0             0           1.283
## 6       3       1949         A2          94.6             0           0.103
##        type sold_within1week selling_price
## 1 Apartment                0      43.27219
## 2     House                0      91.04185
## 3 Apartment                0      74.90384
## 4 Apartment                1      52.83142
## 5 Apartment                0      41.51371
## 6 Apartment                1      43.20737

Graphical representation

plot(d)

ggpairs(d) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Dimension of dataset

dim(d)
## [1] 1011    9

From the str() function, it can be seen that, variables, soldwithin1week and swimpool_1km are supposed to be factors, but R coded them as integers.

str(d)
## 'data.frame':    1011 obs. of  9 variables:
##  $ n.rooms         : int  3 5 4 1 1 3 4 3 3 6 ...
##  $ build.year      : int  1938 2018 1995 1900 1984 1949 1966 2008 1943 1937 ...
##  $ postalcode      : chr  "A2" "A7" "A4" "A1" ...
##  $ square_meters   : num  96.4 132.6 119.7 72.2 56 ...
##  $ swimpool_w1km   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ school_distance : num  0.101 2.17 0.372 0.855 1.283 ...
##  $ type            : chr  "Apartment" "House" "Apartment" "Apartment" ...
##  $ sold_within1week: int  0 0 0 1 0 1 0 0 0 0 ...
##  $ selling_price   : num  43.3 91 74.9 52.8 41.5 ...

Task 1

From the data set, it can be seen that there are 1011 rows or observations and 9 columns, by names; n.rooms, build.year, postalcode, square_meters, swipool_w1km, school_distance, type, sold_within1week and selling price.

The variables n.rooms, build.year, selling_price, school_distance and square_meters are numerical variables.

The variables postalcode, swimpool_wlkm, sold_within1week, type are categorical variables in this dataset.

Task 2

Checking for duplicates in the dataset to clean

Checking for duplication in the rows, I found out that there are 11 rows duplicated in this dataset.

nrow(d)
## [1] 1011
nrow(unique(d))
## [1] 1000
d_2 <- d[!duplicated(d), ]
dim(d_2)
## [1] 1000    9

Recoding the variables: “postalcode”, “type” and “swimpool_w1km as factors to see all levels or classes within each variable.

I noticed that the variables postalcode, type, swimpool_w1k and sold_within1week had different levels (that is 1 = yes and 0 = no), however, it was not evident in the numerical summary.

Again, the variable type also had either Apartment or House, but was also not evident in the numerical summary

d_2$postalcode <- as.factor(d_2$postalcode)
d_2$type <- as.factor(d_2$type)
d_2$sold_within1week <- as.factor(d_2$sold_within1week)
summary(d_2)
##     n.rooms         build.year   postalcode square_meters    swimpool_w1km  
##  Min.   :  1.00   Min.   :1900   A1: 95     Min.   : 56.00   Min.   :0.000  
##  1st Qu.:  2.00   1st Qu.:1961   A2:171     1st Qu.: 98.22   1st Qu.:0.000  
##  Median :  3.00   Median :1991   A3:153     Median :110.40   Median :0.000  
##  Mean   :  3.25   Mean   :1982   A4:207     Mean   :112.46   Mean   :0.047  
##  3rd Qu.:  4.00   3rd Qu.:2007   A5:166     3rd Qu.:124.62   3rd Qu.:0.000  
##  Max.   :222.00   Max.   :2023   A6:114     Max.   :213.70   Max.   :1.000  
##                                  A7: 94                                     
##  school_distance         type     sold_within1week selling_price   
##  Min.   :0.0010   Apartment:850   0:614            Min.   : 32.71  
##  1st Qu.:0.2030   House    :150   1:386            1st Qu.: 55.22  
##  Median :0.4475                                    Median : 65.00  
##  Mean   :0.6112                                    Mean   : 67.77  
##  3rd Qu.:0.7680                                    3rd Qu.: 76.33  
##  Max.   :5.3490                                    Max.   :234.36  
## 

Part 2

Task 3

model_d_2 <- glm(data = d_2, sold_within1week ~ ., family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_d_2)
## 
## Call:
## glm(formula = sold_within1week ~ ., family = "binomial", data = d_2)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      5.168982   9.966027   0.519 0.603998    
## n.rooms         -0.402676   0.083625  -4.815 1.47e-06 ***
## build.year       0.000292   0.005137   0.057 0.954662    
## postalcodeA2    -3.476893   0.455915  -7.626 2.42e-14 ***
## postalcodeA3    -3.898849   0.456763  -8.536  < 2e-16 ***
## postalcodeA4    -3.350284   0.505090  -6.633 3.29e-11 ***
## postalcodeA5    -3.514069   0.537158  -6.542 6.07e-11 ***
## postalcodeA6    -2.940972   0.617407  -4.763 1.90e-06 ***
## postalcodeA7    -3.212262   0.623885  -5.149 2.62e-07 ***
## square_meters   -0.004179   0.005444  -0.768 0.442698    
## swimpool_w1km    1.189388   0.352865   3.371 0.000750 ***
## school_distance -0.531551   0.159205  -3.339 0.000841 ***
## typeHouse       -0.954252   0.258480  -3.692 0.000223 ***
## selling_price   -0.016438   0.007482  -2.197 0.028012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1333.9  on 999  degrees of freedom
## Residual deviance: 1057.9  on 986  degrees of freedom
## AIC: 1085.9
## 
## Number of Fisher Scoring iterations: 6

Task 4

Given the below input, there is 27.3% chance that an apartment with these characteristics provided will be sold in 1 week.

new_data <- data.frame(n.rooms = 3, build.year = 1990, postalcode = "A2", square_meters = 101, swimpool_w1km = 0, school_distance = 1.2, type = "Apartment", selling_price = 60)
predict(model_d_2, newdata = new_data, type = "response")
##         1 
## 0.2726939

Task 5

a

Yes, the seller would have a higher probability of 55.2% for selling the Apartment if there is a swimming pool within 1 km from the building, because when there was no swimming pool the chances was 27.3%, but the presence of swimming pool increased it to 55.2%.

new_data_1 <- data.frame(n.rooms = 3, build.year = 1990, postalcode = "A2", square_meters = 101, swimpool_w1km = 1, school_distance = 1.2, type = "Apartment", selling_price = 60)
predict(model_d_2, newdata = new_data_1, type = "response")
##       1 
## 0.55191

What if the asking price is reduced to 30?

b

I found out that, the probability of selling an apartment in one week is 67%, if the seller lower the asking price to 30 million ISK. Hence, when price is lowered, there is more chances of selling an apartment within a week.

new_data_2 <- data.frame(n.rooms = 3, build.year = 1990, postalcode = "A2", square_meters = 101, swimpool_w1km = 1, school_distance = 1.2, type = "Apartment", selling_price = 30)
predict(model_d_2, newdata = new_data_2, type = "response")
##         1 
## 0.6685264

Part 3

Task 6

set.seed(233)
training <- d_2[sample(1:nrow(d_2),size=(0.75*nrow(d_2)),replace=FALSE),]
test <- d_2[-which(row.names(d_2) %in% row.names(training)),]

Numerical summary

summary(training)
##     n.rooms          build.year   postalcode square_meters   swimpool_w1km    
##  Min.   :  1.000   Min.   :1900   A1: 70     Min.   : 58.0   Min.   :0.00000  
##  1st Qu.:  2.000   1st Qu.:1961   A2:127     1st Qu.: 98.0   1st Qu.:0.00000  
##  Median :  3.000   Median :1990   A3:115     Median :110.3   Median :0.00000  
##  Mean   :  3.316   Mean   :1982   A4:159     Mean   :112.1   Mean   :0.04133  
##  3rd Qu.:  4.000   3rd Qu.:2007   A5:129     3rd Qu.:123.4   3rd Qu.:0.00000  
##  Max.   :222.000   Max.   :2023   A6: 82     Max.   :211.7   Max.   :1.00000  
##                                   A7: 68                                      
##  school_distance         type     sold_within1week selling_price   
##  Min.   :0.0010   Apartment:637   0:454            Min.   : 32.71  
##  1st Qu.:0.2003   House    :113   1:296            1st Qu.: 55.60  
##  Median :0.4400                                    Median : 65.50  
##  Mean   :0.5853                                    Mean   : 67.88  
##  3rd Qu.:0.7495                                    3rd Qu.: 76.69  
##  Max.   :3.3540                                    Max.   :234.36  
## 

Building multiple linear regression model with training data set to predict selling price.

Here I am leaving out the variable sold_within1week because the property has not been sold.

Equation for model1

Selling_price = ß0 + ß1 x n.rooms + ß2 x postalcode + ß3 x square_meters + ß4 x swimpool_w1km + ß5 x build.year + ß6 x school_distance + ß7 x type

model1 <- lm(data = training, selling_price ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)

summary(model1)
## 
## Call:
## lm(formula = selling_price ~ n.rooms + postalcode + square_meters + 
##     swimpool_w1km + build.year + school_distance + type, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.924  -8.033  -0.755   6.682 109.386 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -26.69385   61.13884  -0.437  0.66252    
## n.rooms           0.06498    0.05654   1.149  0.25076    
## postalcodeA2    -18.76925    1.95860  -9.583  < 2e-16 ***
## postalcodeA3     -6.50875    2.09214  -3.111  0.00194 ** 
## postalcodeA4     -7.69433    2.54359  -3.025  0.00257 ** 
## postalcodeA5     -7.21099    2.76655  -2.606  0.00933 ** 
## postalcodeA6    -14.07870    3.34836  -4.205 2.94e-05 ***
## postalcodeA7    -16.37902    3.37853  -4.848 1.52e-06 ***
## square_meters     0.55644    0.02061  26.992  < 2e-16 ***
## swimpool_w1km     0.10479    2.29233   0.046  0.96355    
## build.year        0.02075    0.03159   0.657  0.51140    
## school_distance   0.13437    0.97437   0.138  0.89035    
## typeHouse         5.49084    1.28118   4.286 2.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.36 on 737 degrees of freedom
## Multiple R-squared:  0.5632, Adjusted R-squared:  0.5561 
## F-statistic: 79.19 on 12 and 737 DF,  p-value: < 2.2e-16

Diagnostics of model 1

I am using the plot() function for a quick diagnostics of the model.

The fitted values against residuals diagram, indicate that there is some problem with my model1. That is, there is structure in the residuals indicating non-linearity and non-constant variance. Again, the qq-plot indicates that the residuals of my model1 are not normally distributed.

plot(model1)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Residual v fitted values of model1

ggplot(model1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

Reisual v explanatory variables for model1

plot(training$n.rooms, training$res, 
     xlab = "Predictor n.rooms", 
     ylab = "residuals") 
abline(h=0)

plot(training$build.year, training$res, 
     xlab = "Predictor build.year", 
     ylab = "residuals") 
abline(h=0)

plot(training$postalcode, training$res, 
     xlab = "Predictor postalcode", 
     ylab = "residuals") 
abline(h=0)

plot(training$square_meters, training$res, 
     xlab = "Predictor square_meters", 
     ylab = "residuals") 
abline(h=0)

plot(training$swimpool_w1km, training$res, 
     xlab = "Predictor swimpool_w1km", 
     ylab = "residuals") 
abline(h=0)

plot(training$school_distance, training$res, 
     xlab = "Predictor school_distance", 
     ylab = "residuals") 
abline(h=0)

plot(training$type, training$res, 
     xlab = "Predictor type", 
     ylab = "residuals") 
abline(h=0)

Normal distribution of model1

qqnorm(model1$res,ylab="Raw Residuals")
qqline(model1$res)

Transformtion

Since diagnostics of model1 residuals shows the model is not appropriate, I am going to examine if the boxcox transformation of the response variable can bring a more suitable model than model1.

As the 95% confidence interval includes zero here, the Box-Cox transformation of selling_price will be based on λ = 0, for interpretability purposes. The selling_price variable will be transformed to log(selling_price).

boxcox(model1,plotit=T)

Equation for model2

model2 <- lm(data = training, log(selling_price) ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)

summary(model2)
## 
## Call:
## lm(formula = log(selling_price) ~ n.rooms + postalcode + square_meters + 
##     swimpool_w1km + build.year + school_distance + type, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47119 -0.11095 -0.00113  0.10966  0.52977 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.1312046  0.8235767   3.802 0.000155 ***
## n.rooms          0.0007599  0.0007616   0.998 0.318696    
## postalcodeA2    -0.2874152  0.0263835 -10.894  < 2e-16 ***
## postalcodeA3    -0.0902821  0.0281824  -3.203 0.001416 ** 
## postalcodeA4    -0.0944001  0.0342637  -2.755 0.006012 ** 
## postalcodeA5    -0.0839536  0.0372671  -2.253 0.024568 *  
## postalcodeA6    -0.2069251  0.0451044  -4.588 5.27e-06 ***
## postalcodeA7    -0.2285115  0.0455108  -5.021 6.45e-07 ***
## square_meters    0.0074029  0.0002777  26.659  < 2e-16 ***
## swimpool_w1km   -0.0017130  0.0308791  -0.055 0.955775    
## build.year       0.0001736  0.0004255   0.408 0.683367    
## school_distance  0.0101807  0.0131254   0.776 0.438203    
## typeHouse        0.0759985  0.0172583   4.404 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1665 on 737 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5658 
## F-statistic: 82.34 on 12 and 737 DF,  p-value: < 2.2e-16

Diagnostics 2

I examine again to see the residual v fitted values with diagram

There seems to be some improvement in model2. There is no indication of structure in the residuals or non-constant variance. Also the residuals seem to be approximately normally distributed when examining the normal qq plot. However, there is one observations that have a higher leverage than the rule of thumb suggested by Faraway (threshold=0.032).

I will observe more with jackknife residual

plot(model2)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Residuals v fitted for model2

ggplot(model2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

Plotting residuals against predictors for model2

plot(training$n.rooms, training$res, 
     xlab = "Predictor n.rooms", 
     ylab = "residuals") 
abline(h=0)

plot(training$build.year, training$res, 
     xlab = "Predictor build.year", 
     ylab = "residuals") 
abline(h=0)

plot(training$postalcode, training$res, 
     xlab = "Predictor postalcode", 
     ylab = "residuals") 
abline(h=0)

plot(training$square_meters, training$res, 
     xlab = "Predictor square_meters", 
     ylab = "residuals") 
abline(h=0)

plot(training$swimpool_w1km, training$res, 
     xlab = "Predictor swimpool_w1km", 
     ylab = "residuals") 
abline(h=0)

plot(training$school_distance, training$res, 
     xlab = "Predictor school_distance", 
     ylab = "residuals") 
abline(h=0)

plot(training$type, training$res, 
     xlab = "Predictor type", 
     ylab = "residuals") 
abline(h=0)

Normal distribution of residauls for model2

qqnorm(model2$res,ylab="Raw Residuals")
qqline(model2$res)

Jackknife residual

Observing the jackknife, I can see that the maximum and minimum values are -0.0008467366 and -9.338425 respectively.

Again, the Bonferroni corrected significance threshold is -4.013996.

jack_training <- rstudent(model2)
plot(jack_training,ylab="Jacknife Residuals of training",main="Jacknife Residuals")
abline(h=0)

jack_training[abs(jack_training)==max(abs(jack_training))]
##       104 
## -9.338425
jack_training[abs(jack_training)==min(abs(jack_training))]
##           894 
## -0.0008467366
range(rstudent(model2))
## [1] -9.338425  3.284627
qt (0.05/(750*2),649)
## [1] -4.013996
training$jack <- rstudent(model2)
threshold=qt(0.05/(nrow(training)*2),lower.tail=FALSE,df=(nrow(training)-length(model2$coefficients)-1))

Cook distance

I now turn my attention to cook distance to evaluate the model. If the Cook’s distance in this model is greater than 4/n then I can conclude that the value is an outlier.

t = thresold

t = 4/nrow(training)
cooks_distance <- cooks.distance(model1)
plot(cooks_distance)

There are several numbers that exceed the threshold for cook distance in this model, hence i am removing them.

outliers <- which(cooks_distance>t)
print(outliers)
## 780 907 595 468 310 662  15 376 322 814 416 793 165 429 104 461 499 452 597 471 
##  25  64 101 106 131 141 167 193 214 230 239 348 356 358 429 446 448 466 503 523 
## 684 148 835 798 792 109 
## 524 559 566 589 599 734
training_clean <- training[-c(outliers),]
cat("\nFull dataset records:     ",nrow(training))
## 
## Full dataset records:      750
cat("\nFiltered dataset records: ", nrow(training_clean))
## 
## Filtered dataset records:  724

Refitting a new model after removal of outliers

Equation for model3

Selling_price = ß0 + ß1 x n.rooms + ß2 x postalcode + ß3 x square_meters + ß4 x swimpool_w1km + ß5 x build.year + ß6 x school_distance + ß7 x type

model3 <- lm(data = training_clean , selling_price ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)

summary(model3)
## 
## Call:
## lm(formula = selling_price ~ n.rooms + postalcode + square_meters + 
##     swimpool_w1km + build.year + school_distance + type, data = training_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.427  -6.790  -0.683   6.726  28.835 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -22.16626   50.14171  -0.442 0.658571    
## n.rooms           2.92721    0.35627   8.216 9.92e-16 ***
## postalcodeA2    -18.98089    1.60518 -11.825  < 2e-16 ***
## postalcodeA3     -7.98700    1.72137  -4.640 4.15e-06 ***
## postalcodeA4     -8.12618    2.08430  -3.899 0.000106 ***
## postalcodeA5     -6.51359    2.26551  -2.875 0.004160 ** 
## postalcodeA6    -16.18358    2.74553  -5.895 5.80e-09 ***
## postalcodeA7    -17.56467    2.76445  -6.354 3.75e-10 ***
## square_meters     0.35690    0.02415  14.775  < 2e-16 ***
## swimpool_w1km    -1.84791    1.93413  -0.955 0.339690    
## build.year        0.02524    0.02587   0.976 0.329459    
## school_distance   0.91964    0.79663   1.154 0.248715    
## typeHouse         4.60351    1.08543   4.241 2.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.891 on 711 degrees of freedom
## Multiple R-squared:  0.6159, Adjusted R-squared:  0.6094 
## F-statistic:    95 on 12 and 711 DF,  p-value: < 2.2e-16

Diagram for model 3

Again, there is some improvement in model3. There is no indication of structure in the residuals. Also the residuals seem to be approximately normally distributed when examining the normal qq plot

plot(model3)

Residuals v fitted for model3

ggplot(model3, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

Plotting residuals against all explantory variables

From the plot i saw much improvement in all variables, and there was nothing suspicous to detect compared to the fist two models

plot(training_clean$n.rooms, training_clean$res, 
     xlab = "Predictor n.rooms", 
     ylab = "residuals") 
abline(h=0)

plot(training_clean$build.year, training_clean$res, 
     xlab = "Predictor build.year", 
     ylab = "residuals") 
abline(h=0)

plot(training_clean$postalcode, training_clean$res, 
     xlab = "Predictor postalcode", 
     ylab = "residuals") 
abline(h=0)

plot(training_clean$square_meters, training_clean$res, 
     xlab = "Predictor square_meters", 
     ylab = "residuals") 
abline(h=0)

plot(training_clean$swimpool_w1km, training_clean$res, 
     xlab = "Predictor swimpool_w1km", 
     ylab = "residuals") 
abline(h=0)

plot(training_clean$school_distance, training_clean$res, 
     xlab = "Predictor school_distance", 
     ylab = "residuals") 
abline(h=0)

plot(training_clean$type, training_clean$res, 
     xlab = "Predictor type", 
     ylab = "residuals") 
abline(h=0)

Accesing the normal distribution of model 3

Also the residuals seem to be approximately normally distributed when examining the normal qq plot

qqnorm(model3$res,ylab="Raw Residuals")
qqline(model3$res)

Variable selection

I am going to build regression model from all set of predictor variables by removing predictors based on p values less greater than 0.05 . In this case, the variables swimpool_w1km, build.year and school_distance were dropped from the model.

Despite, having a lesser R2 value, I believe there is no much difference between my model4 and model3, and hence i prefer to use model4 as my final model, since all variables are significant at the 0.05 alpha level.

Equation of my final model, model4

Selling price = 27.11485 + (2.91387)n.rooms -18.53843(postalcodeA2) -7.35384(postalcodeA3) -6.78113(postalcodeA4) -5.10811(postalcodeA5) -13.48789(postalcodeA6) -14.97878(postalcodeA6) + 0.35708(square_meters) + 4.71228(type)

Estimate of residual standard error

The estimate of residual standard error in y final model is 9.892.

model4 <- lm(data = training_clean , selling_price ~ n.rooms + postalcode + square_meters + type)

summary(model4)
## 
## Call:
## lm(formula = selling_price ~ n.rooms + postalcode + square_meters + 
##     type, data = training_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.6147  -6.8596  -0.7441   6.7416  29.0273 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    27.11485    2.38072  11.389  < 2e-16 ***
## n.rooms         2.91387    0.35533   8.201 1.11e-15 ***
## postalcodeA2  -18.53843    1.51180 -12.262  < 2e-16 ***
## postalcodeA3   -7.35384    1.54641  -4.755 2.40e-06 ***
## postalcodeA4   -6.78113    1.45591  -4.658 3.81e-06 ***
## postalcodeA5   -5.10811    1.50236  -3.400 0.000711 ***
## postalcodeA6  -13.48789    1.65017  -8.174 1.36e-15 ***
## postalcodeA7  -14.97878    1.72248  -8.696  < 2e-16 ***
## square_meters   0.35708    0.02409  14.823  < 2e-16 ***
## typeHouse       4.71228    1.08323   4.350 1.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.892 on 714 degrees of freedom
## Multiple R-squared:  0.6142, Adjusted R-squared:  0.6093 
## F-statistic: 126.3 on 9 and 714 DF,  p-value: < 2.2e-16
summary(model4)$r.squared
## [1] 0.6141888

Diagram for model4

There is a good fit for residuals in model4 and there is no indication of structure in the residuals diagram. Also the residuals seem to be approximately normally distributed when examining the normal qq plot

plot(model4)

Reiduals v fitted

ggplot(model4, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

Accessing the normal distribution of residuals

qqnorm(model4$res,ylab="Raw Residuals")
qqline(model4$res)

Diagram of observed selling price in the test data set

It can be seen that there is a positive relationship between the observed selling price and the predicted selling price in the test data from the diagram.

model.test <- lm(data = test , selling_price ~ n.rooms + postalcode + square_meters + type)


plot_data <- data.frame(Predicted_value = predict(model.test),   
                       Observed_value = test$selling_price) 

ggplot(plot_data, aes(x = Predicted_value, y = Observed_value)) + 
                  geom_point() + 
                 geom_abline(intercept = 0, slope = 1, color = "green")

The RMSE for model4, for test data

The root mean square error for the test data is about 10.4 millions ISK

model.test <- lm(data = test , selling_price ~ n.rooms + postalcode + square_meters + type)
pred_selling_price <- predict(model.test, test)
test %>% summarize(RMSE(pred_selling_price, selling_price))
##   RMSE(pred_selling_price, selling_price)
## 1                                10.35646

The RMSE for model3, for test data

The root mean square error for the test data is about 10.3 millions ISK

model.test2 <- lm(data = test , selling_price ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)
pred_selling_price.m.t.2 <- predict(model.test2, test)
test %>% summarize(RMSE(pred_selling_price.m.t.2, selling_price))
##   RMSE(pred_selling_price.m.t.2, selling_price)
## 1                                       10.2935